🏁 Pole to Podium: How F1 Champions Are Made¶

QTM 151: Introduction to Statistical Computing II — Final Project¶

Reproducibility¶

All code, data and detailed documentation for this analysis are available on GitHub.
You can clone or browse the repository here:

https://github.com/yrqoeuqo123/FinalProject_QTM151.git

Feel free to explore the notebooks, CSVs and instructions there to reproduce every step of the project.

In [1]:
team_members = [
    "Amanda Middelthon",
    "Anika Chandra",
    "Cassiel Chen",
    "Tianyi Zhang"
]


decorative_line = "~" * 60
print(decorative_line)
print("🌊 TEAM MEMBERS 🌊".center(60))
print(decorative_line)
for member in team_members:
    print(f"~ {member.center(56)} ~")
print(decorative_line)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                      🌊 TEAM MEMBERS 🌊                      
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~                    Amanda Middelthon                     ~
~                      Anika Chandra                       ~
~                       Cassiel Chen                       ~
~                       Tianyi Zhang                       ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Section 1: Introduction¶

Formula 1 (F1) is the world’s premier open-wheel racing series, where cutting-edge engineering, peak athlete performance, and split-second strategy collide. Since its inaugural season in 1950, F1 has grown into a truly global spectacle—spanning five continents and challenging drivers on a dizzying array of circuits, from high-altitude mountain tracks to tight street courses. Fans and analysts alike follow every qualifying session, pit stop, and photo finish on Formula1.com and in encyclopedic resources such as Wikipedia’s Formula One page.

Modern F1 Race Car
Figure 1: A modern Ferrari F1 car in action

We’re interested not just in who wins, but where and how those victories happen—and how the sport’s competitive geography has shifted over time. To that end, we answer two focused questions:

  1. RQ1: Which nationalities have produced the most podium finishes, and how has each nationality’s podium-rate evolved by debut decade?
    Why?

    • (a) Total podium counts reveal which countries dominate F1 history.
    • (b) Podium-rate by decade uncovers shifts in global competitiveness and emerging talent markets.

    Tables: Drivers, Results, Races
    Sub-questions & Approach:

    1. Total podiums
      • Join drivers.driverId → results.driverId.
      • Cast positionOrder to integer; flag podium = (positionOrder ≤ 3).
      • Count podiums by nationality.
    2. Decade-by-decade podium-rate
      • Merge results.raceId → races.raceId to get race dates.
      • Derive each driver’s debut_decade (first appearance, rounded to the nearest decade).
      • Compute entries vs. podiums per nationality × debut_decade; calculate podium_rate = podiums / entries.
  2. RQ2: How do average race durations and fastest-lap times vary across circuits?
    Why? Circuit layout, elevation, and length all influence lap speeds and race pace—comparing these metrics highlights which tracks favor outright speed versus those rewarding endurance and strategy.
    Tables: Results, Circuits, Status
    Approach:

    1. Filter to classified finishers (status == "Finished" or matches "+n Laps").
    2. Convert milliseconds and fastestLapTime into seconds.
    3. Compute per-circuit averages for race duration, fastest-lap time, and finisher counts.

Preview of findings¶

  • Nations with vast F1 legacies—Britain and Germany—dominate raw podium totals, yet small programs like Finland and Argentina punch far above their weight with podium-per-start rates exceeding 20%.
  • Circuit analyses show that “temples of speed” such as Monza and the Red Bull Ring deliver the shortest races and quickest laps, while tighter street courses dramatically extend race time.

We first load and inspect these tables (Section 2), then perform the cleaning, merging, and calculations needed for each question (Section 3), and conclude with an interpretation of our findings (Section 4).

Section 2: Data Description¶

For our analysis, we are focusing on a curated subset of the comprehensive Formula 1 dataset, which comprises 14 tables. We have selected five tables—Drivers, Races, Results, Circuits, and Status—because they offer a multi-dimensional view of the sport and allow us to filter for classified finishers where needed.

  • Drivers: Each row represents a unique Formula 1 driver and includes essential details such as the driver's name, nationality, and date of birth. This table provides the backbone for understanding personal attributes and backgrounds.

  • Races: Captures the details of every race event—race date, round, season, and circuit linkage—spanning the sport’s evolution. We use it to derive decade-of-debut and to join race-level metadata.

  • Results: Records each driver’s performance in a given race: starting grid, finishing position (positionOrder), points, lap counts, total race time (milliseconds), fastest lap details, and a statusId. This is the core table for our performance, podium, and speed analyses.

  • Circuits: Describes each race venue—its name, location, country, and altitude—enabling us to examine how track characteristics affect average race and lap times.

  • Status: Maps statusId to finishing status strings (e.g., “Finished”, “Accident”, “+1 Lap”). We join this to Results to distinguish true finishers from DNFs or technical retirements, ensuring our circuit-speed calculations and podium denominators are based only on classified outcomes.

Figure 2: Selected datasets for our study

2.1 Library Imports and Directory Setup¶

Before beginning our analysis, we need to import several Python libraries that will help us manipulate data, perform numerical computations, create visualizations, and map circuit locations. We also set up our environment by determining the current working directory and specifying the subfolder (data_raw) where our raw data is stored.

  • Pandas: efficient data manipulation with DataFrame objects
  • NumPy: numerical operations and computations
  • Matplotlib and Seaborn: versatile plotting libraries for both basic and advanced visualizations
  • OS: interact with the operating system to manage file paths
  • Folium: generate interactive web maps for geographic analysis
  • PercentFormatter from matplotlib.ticker: format axis labels as percentages

Below is the code snippet that imports these libraries, configures our plotting settings, and sets the directory for our raw data.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from matplotlib.ticker import PercentFormatter
import folium

current_dir = os.getcwd()
data_dir = os.path.join(current_dir, 'data_raw')

# print("Current working directory:", current_dir)
# print("Data directory:", data_dir)

2.2 Loading Datasets and Counting Observations¶

In this section, we load the selected Formula 1 datasets from our specified data_raw directory. We work with four key tables:

  • Drivers: Contains information about each driver, such as name, nationality, and date of birth.
  • Races: Records details of every race event, including the race date, round, and season.
  • Results: Provides performance metrics for drivers in each race, such as finishing positions and points scored.
  • Circuits: Holds details about the race circuits, including circuit name, location, and geographic coordinates.

The following code snippet reads each dataset from its corresponding CSV file and prints out the number of rows to ensure that the data has loaded correctly.

In [3]:
# Load the datasets and count the number of observations in each
drivers = pd.read_csv(os.path.join(data_dir, 'drivers.csv'))
races = pd.read_csv(os.path.join(data_dir, 'races.csv'))
results = pd.read_csv(os.path.join(data_dir, 'results.csv'))
circuits = pd.read_csv(os.path.join(data_dir, 'circuits.csv'))
status = pd.read_csv(os.path.join(data_dir, 'status.csv'))


for df, name in [(drivers, "Drivers"), (races, "Races"),
                 (results, "Results"), (circuits, "Circuits"),
                 (status, "Status")]:
    print(f"{name}: {df.shape[0]} rows × {df.shape[1]} cols")
    display(df.head(2))
Drivers: 857 rows × 9 cols
driverId driverRef number code forename surname dob nationality url
0 1 hamilton 44 HAM Lewis Hamilton 1985-01-07 British http://en.wikipedia.org/wiki/Lewis_Hamilton
1 2 heidfeld \N HEI Nick Heidfeld 1977-05-10 German http://en.wikipedia.org/wiki/Nick_Heidfeld
Races: 1102 rows × 18 cols
raceId year round circuitId name date time url fp1_date fp1_time fp2_date fp2_time fp3_date fp3_time quali_date quali_time sprint_date sprint_time
0 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... \N \N \N \N \N \N \N \N \N \N
1 2 2009 2 2 Malaysian Grand Prix 2009-04-05 09:00:00 http://en.wikipedia.org/wiki/2009_Malaysian_Gr... \N \N \N \N \N \N \N \N \N \N
Results: 25840 rows × 18 cols
resultId raceId driverId constructorId number grid position positionText positionOrder points laps time milliseconds fastestLap rank fastestLapTime fastestLapSpeed statusId
0 1 18 1 1 22 1 1 1 1 10.0 58 1:34:50.616 5690616 39 2 1:27.452 218.300 1
1 2 18 2 2 3 5 2 2 2 8.0 58 +5.478 5696094 41 3 1:27.739 217.586 1
Circuits: 77 rows × 9 cols
circuitId circuitRef name location country lat lng alt url
0 1 albert_park Albert Park Grand Prix Circuit Melbourne Australia -37.84970 144.968 10 http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1 2 sepang Sepang International Circuit Kuala Lumpur Malaysia 2.76083 101.738 18 http://en.wikipedia.org/wiki/Sepang_Internatio...
Status: 139 rows × 2 cols
statusId status
0 1 Finished
1 2 Disqualified

2.3 'Drivers' Dataset Exploration¶

Table Schema

Field Type Null Key Default Description
driverId int(11) NO PRI — Primary key (auto-increment)
driverRef varchar(255) NO — Unique driver identifier
number int(11) YES — Permanent driver number
code varchar(3) YES — Driver code (e.g. “ALO”)
forename varchar(255) NO — Driver forename
surname varchar(255) NO — Driver surname
dob date YES — Driver date of birth
nationality varchar(255) YES — Driver nationality
url varchar(255) NO UNI — Wikipedia page (unique)

The Drivers table contains one row per F1 competitor (857 entries). The driverId field serves as the internal primary key, while driverRef uniquely identifies each driver externally. Permanent race numbers (number) and three-letter codes (code) are provided when available. Both forename and surname are required, ensuring clear naming, whereas dob and nationality may be null if unknown. Finally, the url column links to each driver’s Wikipedia page, offering a reliable reference for further biographical details. This schema gives us all the demographic and identity information needed for our nationality-based analyses.

Data Exploration¶

The Drivers dataset contains information for each unique Formula 1 driver, such as their name, nationality, and date of birth. In the code below, we display the first five rows, the detailed dataset information (data types and non-null counts), and summary statistics to understand the structure and composition of the data.

In [4]:
# Preview the first 5 rows of the Drivers dataset
print("Drivers dataset loaded with", drivers.shape[0], "rows.")
print("First 5 Rows of the Drivers Dataset:")
print(drivers.head())
Drivers dataset loaded with 857 rows.
First 5 Rows of the Drivers Dataset:
   driverId   driverRef number code  forename     surname         dob  \
0         1    hamilton     44  HAM     Lewis    Hamilton  1985-01-07   
1         2    heidfeld     \N  HEI      Nick    Heidfeld  1977-05-10   
2         3     rosberg      6  ROS      Nico     Rosberg  1985-06-27   
3         4      alonso     14  ALO  Fernando      Alonso  1981-07-29   
4         5  kovalainen     \N  KOV    Heikki  Kovalainen  1981-10-19   

  nationality                                             url  
0     British     http://en.wikipedia.org/wiki/Lewis_Hamilton  
1      German      http://en.wikipedia.org/wiki/Nick_Heidfeld  
2      German       http://en.wikipedia.org/wiki/Nico_Rosberg  
3     Spanish    http://en.wikipedia.org/wiki/Fernando_Alonso  
4     Finnish  http://en.wikipedia.org/wiki/Heikki_Kovalainen  

Above is a snapshot of the first five rows from the Drivers dataset, which comprises 857 records in total. Each row represents a unique Formula 1 driver and includes key information such as the internal driver ID, an external reference, the driver's racing number (which may be absent in some cases), a short driver code, first name, last name, date of birth, nationality, and a URL linking to their Wikipedia page. For instance, the first entry in this dataset corresponds to Lewis Hamilton, one of the sport’s most celebrated figures. To honor his legacy, we also include an image of him below.

Lewis Hamilton Figure 3: Picture of Lewis Hamilton - F1 Driver for Ferrari

In [5]:
# Display dataset information
print("Dataset Information:")
drivers.info()
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857 entries, 0 to 856
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   driverId     857 non-null    int64 
 1   driverRef    857 non-null    object
 2   number       857 non-null    object
 3   code         857 non-null    object
 4   forename     857 non-null    object
 5   surname      857 non-null    object
 6   dob          857 non-null    object
 7   nationality  857 non-null    object
 8   url          857 non-null    object
dtypes: int64(1), object(8)
memory usage: 60.4+ KB

Every column in the Drivers table is fully populated, with no missing rows. The only integer field, driverId, serves as our primary key for all merges. The other eight columns are stored as strings (or dates once converted), covering identifiers (driverRef), personal details (forename, surname, dob, nationality), race metadata (number, code), and reference links (url). Several of these—particularly number, code, dob, and nationality—use the placeholder "\N" for missing entries. In the next step we’ll replace "\N" with NaN and cast each column to its proper type so that null values are handled correctly in our analyses.

Note on Summary Statistics: Running a summary statistics table on the Drivers table would yield little insight because most fields are unique IDs or high‐cardinality strings. Instead, we’ll generate two focused summaries that directly inform our analysis: a nationality frequency table to show which countries are most represented on the grid and a count of how many drivers have a permanent number versus missing entries, revealing how many carry a fixed race number.

In [6]:
nat_counts = drivers['nationality'].value_counts()
top_nat = nat_counts.head(10)

plt.figure(figsize=(9, 5))
bars = plt.barh(top_nat.index[::-1], top_nat.values[::-1])
plt.xlabel('Number of Drivers')
plt.title('Top 10 Driver Nationalities')
plt.grid(axis='x', linestyle='--', alpha=0.6)

for bar in bars:
    width = bar.get_width()
    plt.text(width + 1, bar.get_y() + bar.get_height() / 2, str(int(width)), va='center')

plt.tight_layout()
plt.show()
No description has been provided for this image

This chart shows the all-time headcount of Formula 1 drivers by nationality. The U.K. tops the list with 165 drivers, narrowly ahead of the U.S. at 158—underscoring both countries’ deep motorsport infrastructures and feeder series. Italy (99) and France (73) follow as the next most prolific talent pools, with Germany (50) rounding out the top five. Beyond Europe’s “Big Four,” Brazil (32) and Argentina (24) reflect South America’s rich F1 heritage, while Belgium, Switzerland, and South Africa each contributed 23 drivers. The sharp decline after the leading nations illustrates how a small group of countries has historically dominated F1’s driver pipeline.

In [7]:
# replace "\\N" with NaN so we can count the number of drivers with and without a permanent number
drivers['number'] = drivers['number'].replace("\\N", pd.NA)

assigned = drivers['number'].notna().sum()
missing  = drivers['number'].isna().sum()

print(f"Drivers with a permanent number: {assigned}")
print(f"Drivers without a permanent number: {missing}")
Drivers with a permanent number: 54
Drivers without a permanent number: 803

Only 54 drivers have a non-null number because permanent driver numbers weren’t introduced until 2014. Before then, competitors used whatever number was assigned each weekend, so all pre-2014 drivers (and any who never opted in) appear as missing (NaN), accounting for the 803 without a permanent number.

2.4 'Races' Dataset Exploration¶

Races Table Schema¶

From the Ergast codebook, the races table is defined with 18 columns:

Field Type Null Key Default Description
raceId int(11) NO PRI — Primary key (auto-increment)
year int(11) NO 0 Championship season (e.g. 1950)
round int(11) NO 0 Sequence number within the season
circuitId int(11) NO 0 Foreign key → circuits.circuitId
name varchar(255) NO — Official race name (e.g. “Monaco Grand Prix”)
date date NO 0000-00-00 Race date (ISO format)
time time YES — Scheduled start time
url varchar(255) YES UNI — Wikipedia link for the event
fp1_date date YES — Free Practice 1 date
fp1_time time YES — Free Practice 1 start time
fp2_date date YES — Free Practice 2 date
fp2_time time YES — Free Practice 2 start time
fp3_date date YES — Free Practice 3 date
fp3_time time YES — Free Practice 3 start time
quali_date date YES — Qualifying session date
quali_time time YES — Qualifying start time
sprint_date date YES — Sprint race date
sprint_time time YES — Sprint race start time
  • The first seven fields (raceId through time) are all non-nullable and form the core metadata we need for merging and analysis.
  • All session-specific fields (fp*, quali*, sprint*) allow NULL and are often unused in early seasons; we can ignore or clean these when focusing on main race data.

This schema confirms that our primary keys and join fields are complete and ready for the merges required by our three focused research questions.

Data Exploration¶

The Races dataset captures one row per Grand Prix and includes key scheduling and identification fields—such as raceId, year, round, circuitId, name, date, and time. In the code below, we display the first five rows to see the earliest events, call .info() to check data types and non‑null counts, and use .describe(include='all') to review summary statistics across both numeric and categorical columns.

In [8]:
# Preview the first 5 rows of the Races dataset
print("Races dataset loaded with", races.shape[0], "rows.")
print("First 5 Rows of the Races Dataset:")
print(races.head(), "\n")
Races dataset loaded with 1102 rows.
First 5 Rows of the Races Dataset:
   raceId  year  round  circuitId                   name        date  \
0       1  2009      1          1  Australian Grand Prix  2009-03-29   
1       2  2009      2          2   Malaysian Grand Prix  2009-04-05   
2       3  2009      3         17     Chinese Grand Prix  2009-04-19   
3       4  2009      4          3     Bahrain Grand Prix  2009-04-26   
4       5  2009      5          4     Spanish Grand Prix  2009-05-10   

       time                                                url fp1_date  \
0  06:00:00  http://en.wikipedia.org/wiki/2009_Australian_G...       \N   
1  09:00:00  http://en.wikipedia.org/wiki/2009_Malaysian_Gr...       \N   
2  07:00:00  http://en.wikipedia.org/wiki/2009_Chinese_Gran...       \N   
3  12:00:00  http://en.wikipedia.org/wiki/2009_Bahrain_Gran...       \N   
4  12:00:00  http://en.wikipedia.org/wiki/2009_Spanish_Gran...       \N   

  fp1_time fp2_date fp2_time fp3_date fp3_time quali_date quali_time  \
0       \N       \N       \N       \N       \N         \N         \N   
1       \N       \N       \N       \N       \N         \N         \N   
2       \N       \N       \N       \N       \N         \N         \N   
3       \N       \N       \N       \N       \N         \N         \N   
4       \N       \N       \N       \N       \N         \N         \N   

  sprint_date sprint_time  
0          \N          \N  
1          \N          \N  
2          \N          \N  
3          \N          \N  
4          \N          \N   

The code snippet above displays the first five rows of the Races table, confirming that each row represents one Grand Prix event. In our sample, we see the opening five races of the 2009 season (Australian, Malaysian, Chinese, Bahrain, Spanish). Key columns—raceId, year, round, circuitId, name, date, time, and url—are all present and properly formatted. This quick look verifies that our core scheduling and identification fields loaded correctly and are ready for the merges required by our analysis.

In [9]:
print("Dataset Information:")
races.info()
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1102 entries, 0 to 1101
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   raceId       1102 non-null   int64 
 1   year         1102 non-null   int64 
 2   round        1102 non-null   int64 
 3   circuitId    1102 non-null   int64 
 4   name         1102 non-null   object
 5   date         1102 non-null   object
 6   time         1102 non-null   object
 7   url          1102 non-null   object
 8   fp1_date     1102 non-null   object
 9   fp1_time     1102 non-null   object
 10  fp2_date     1102 non-null   object
 11  fp2_time     1102 non-null   object
 12  fp3_date     1102 non-null   object
 13  fp3_time     1102 non-null   object
 14  quali_date   1102 non-null   object
 15  quali_time   1102 non-null   object
 16  sprint_date  1102 non-null   object
 17  sprint_time  1102 non-null   object
dtypes: int64(4), object(14)
memory usage: 155.1+ KB

The output shows 1,102 total entries (index 0–1101) and 18 columns, using about 155 KB of memory. Four integer columns (raceId, year, round, circuitId) are fully populated with no missing values, providing robust join keys and season metadata. The other 14 object-type columns include race names, dates/times, and URLs, while the practice (fp1_*, fp2_*, fp3_*), qualifying (quali_*), and sprint (sprint_*) fields uniformly contain the placeholder “\N” where session data is unavailable. Overall, the completeness of the core fields and clearly marked placeholders for secondary sessions satisfy our assignment’s requirement to document data structure and readiness for analysis.

In [10]:
print("\nSummary Statistics:")
print(races.describe(include='all'))
Summary Statistics:
             raceId         year        round    circuitId  \
count   1102.000000  1102.000000  1102.000000  1102.000000   
unique          NaN          NaN          NaN          NaN   
top             NaN          NaN          NaN          NaN   
freq            NaN          NaN          NaN          NaN   
mean     553.853902  1992.049002     8.507260    23.697822   
std      321.706151    20.308639     5.097545    19.337398   
min        1.000000  1950.000000     1.000000     1.000000   
25%      276.250000  1976.000000     4.000000     9.000000   
50%      551.500000  1994.000000     8.000000    18.000000   
75%      826.750000  2010.000000    12.000000    34.000000   
max     1120.000000  2023.000000    23.000000    80.000000   

                      name        date  time  \
count                 1102        1102  1102   
unique                  54        1102    34   
top     Italian Grand Prix  2009-03-29    \N   
freq                    74           1   731   
mean                   NaN         NaN   NaN   
std                    NaN         NaN   NaN   
min                    NaN         NaN   NaN   
25%                    NaN         NaN   NaN   
50%                    NaN         NaN   NaN   
75%                    NaN         NaN   NaN   
max                    NaN         NaN   NaN   

                                                      url fp1_date fp1_time  \
count                                                1102     1102     1102   
unique                                               1102       68       19   
top     http://en.wikipedia.org/wiki/2009_Australian_G...       \N       \N   
freq                                                    1     1035     1057   
mean                                                  NaN      NaN      NaN   
std                                                   NaN      NaN      NaN   
min                                                   NaN      NaN      NaN   
25%                                                   NaN      NaN      NaN   
50%                                                   NaN      NaN      NaN   
75%                                                   NaN      NaN      NaN   
max                                                   NaN      NaN      NaN   

       fp2_date fp2_time fp3_date fp3_time quali_date quali_time sprint_date  \
count      1102     1102     1102     1102       1102       1102        1102   
unique       68       16       56       17         68         13          13   
top          \N       \N       \N       \N         \N         \N          \N   
freq       1035     1057     1047     1066       1035       1057        1090   
mean        NaN      NaN      NaN      NaN        NaN        NaN         NaN   
std         NaN      NaN      NaN      NaN        NaN        NaN         NaN   
min         NaN      NaN      NaN      NaN        NaN        NaN         NaN   
25%         NaN      NaN      NaN      NaN        NaN        NaN         NaN   
50%         NaN      NaN      NaN      NaN        NaN        NaN         NaN   
75%         NaN      NaN      NaN      NaN        NaN        NaN         NaN   
max         NaN      NaN      NaN      NaN        NaN        NaN         NaN   

       sprint_time  
count         1102  
unique           6  
top             \N  
freq          1093  
mean           NaN  
std            NaN  
min            NaN  
25%            NaN  
50%            NaN  
75%            NaN  
max            NaN  
In [11]:
# Compute number of races per year
races_per_year = races['year'].value_counts().sort_index()

# Improved aesthetics
plt.figure(figsize=(10, 5))
plt.plot(races_per_year.index, races_per_year.values, marker='o', linewidth=2)
plt.grid(True, linestyle='--', alpha=0.6)
plt.xlabel('Year')
plt.ylabel('Number of Races')
plt.title('Formula 1 Calendar Size by Year')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
No description has been provided for this image

The plot shows how the F1 calendar has steadily expanded since 1950. In the inaugural season there were just seven races; through the 1960s it crept into the low‑10s before jumping to around 15–16 events by the mid‑1970s. From the 1980s through the 1990s the schedule held fairly constant at roughly 16 Grands Prix per year, then climbed again in the 2000s and 2010s into the high‑teens and low‑20s. You can also spot the COVID‑related dip in 2020 (down to 17 races), followed by a post‑pandemic surge peaking at 23 races in 2023. Overall, this upward trend underscores F1’s continual global growth and growing commercial appeal.

2.5 'Results' Dataset Exploration¶

Results Table Schema¶

Field Type Null Key Default Description
resultId int(11) NO PRI — Primary key (auto-increment)
raceId int(11) NO 0 Foreign key → races.raceId
driverId int(11) NO 0 Foreign key → drivers.driverId
constructorId int(11) NO 0 Foreign key → constructors.constructorId
number int(11) YES — Driver number for that race
grid int(11) NO 0 Starting grid position
position int(11) YES — Official finishing classification (if applicable)
positionText varchar(255) NO — Textual position (e.g. "1" or "R")
positionOrder int(11) NO 0 Numeric position for ordering/comparison
points float NO 0 Championship points scored in that race
laps int(11) NO 0 Number of laps completed
time varchar(255) YES — Finishing time or gap (string)
milliseconds int(11) YES — Finishing time in milliseconds
fastestLap int(11) YES — Lap number on which the driver set their fastest lap
rank int(11) YES 0 Rank of that fastest lap against other drivers
fastestLapTime varchar(255) YES — Fastest lap time (e.g. "1:27.453")
fastestLapSpeed varchar(255) YES — Fastest lap speed (km/h, e.g. "213.874")
statusId int(11) NO 0 Foreign key → status.statusId (finish status code)

This Results table captures every driver’s outcome in each race: who they drove for (constructorId), where they started (grid), where they finished (position/positionOrder and points), and their fastest lap details (fastestLap, fastestLapTime, rank, fastestLapSpeed). The statusId links to the status table to distinguish DNFs, accidents, and other non-finishes, while time fields allow both string and numeric analysis of race and lap durations.

Data Exploration¶

The Results dataset catalogs driver performance metrics for each race (e.g., finishing positions, points scored). The exploration code below examines this dataset by displaying its first five rows, detailed dataset information, and summary statistics, providing insights into the various performance metrics available.

In [12]:
# Preview the first 5 rows of the Results dataset
print("Results dataset loaded with", results.shape[0], "rows.")
print("First 5 Rows of the Results Dataset:")
print(results.head())
Results dataset loaded with 25840 rows.
First 5 Rows of the Results Dataset:
   resultId  raceId  driverId  constructorId number  grid position  \
0         1      18         1              1     22     1        1   
1         2      18         2              2      3     5        2   
2         3      18         3              3      7     7        3   
3         4      18         4              4      5    11        4   
4         5      18         5              1     23     3        5   

  positionText  positionOrder  points  laps         time milliseconds  \
0            1              1    10.0    58  1:34:50.616      5690616   
1            2              2     8.0    58       +5.478      5696094   
2            3              3     6.0    58       +8.163      5698779   
3            4              4     5.0    58      +17.181      5707797   
4            5              5     4.0    58      +18.014      5708630   

  fastestLap rank fastestLapTime fastestLapSpeed  statusId  
0         39    2       1:27.452         218.300         1  
1         41    3       1:27.739         217.586         1  
2         41    5       1:28.090         216.719         1  
3         58    7       1:28.603         215.464         1  
4         43    1       1:27.418         218.385         1  

The Results table contains 25,840 entries—each one recording a single driver’s performance in a particular Grand Prix. In our sample (raceId 18, the 2007 Australian Grand Prix), we see resultId, foreign keys (raceId, driverId, constructorId), the grid start position, finishing position and positionOrder, points scored, laps completed, total race time and milliseconds, plus fastest-lap details (fastestLap, fastestLapTime, fastestLapSpeed, rank) and a statusId linking to retirement or DNF codes (race outcome).

In [13]:
print("Dataset Information:")
results.info()
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25840 entries, 0 to 25839
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   resultId         25840 non-null  int64  
 1   raceId           25840 non-null  int64  
 2   driverId         25840 non-null  int64  
 3   constructorId    25840 non-null  int64  
 4   number           25840 non-null  object 
 5   grid             25840 non-null  int64  
 6   position         25840 non-null  object 
 7   positionText     25840 non-null  object 
 8   positionOrder    25840 non-null  int64  
 9   points           25840 non-null  float64
 10  laps             25840 non-null  int64  
 11  time             25840 non-null  object 
 12  milliseconds     25840 non-null  object 
 13  fastestLap       25840 non-null  object 
 14  rank             25840 non-null  object 
 15  fastestLapTime   25840 non-null  object 
 16  fastestLapSpeed  25840 non-null  object 
 17  statusId         25840 non-null  int64  
dtypes: float64(1), int64(8), object(9)
memory usage: 3.5+ MB

The Results table contains 25,840 rows across 18 columns. Most fields are complete and appropriately typed—eight are integers (e.g., raceId, grid, positionOrder), while others like number, position, time, and lap statistics are stored as object due to mixed formatting or placeholder values (e.g., "\N"). Several of these columns—such as milliseconds, fastestLapTime, and fastestLapSpeed—will benefit from cleaning and type conversion for numerical analysis. The final column, statusId, links to the Status table (see schema above), allowing us to interpret whether a driver finished, retired, or encountered a mechanical issue.

In [14]:
# 1. Replace "\N" placeholders with NaN
results = results.replace("\\N", pd.NA)

# 2. Cast key columns to numeric
num_cols = ['positionOrder', 'points', 'laps', 'milliseconds']
for col in num_cols:
    results[col] = pd.to_numeric(results[col], errors='coerce')

# 3. Parse times into seconds

# 3a. Race time in seconds
results['race_time_s'] = results['milliseconds'] / 1000

# 3b. Fastest lap time in seconds (e.g. "1:27.452" → 87.452)
def lap_time_to_seconds(x):
    if pd.isna(x):
        return np.nan
    minutes, seconds = x.split(':')
    return int(minutes) * 60 + float(seconds)

results['fastestLap_s'] = results['fastestLapTime'].apply(lap_time_to_seconds)


summary = results[['positionOrder', 'points', 'laps', 'race_time_s', 'fastestLap_s']].describe()
print(summary)
       positionOrder        points          laps   race_time_s  fastestLap_s
count   25840.000000  25840.000000  25840.000000   7087.000000   7379.000000
mean       12.876006      1.877053     45.977515   6231.870431     91.112699
std         7.712391      4.169849     29.808951   1678.932890     12.505799
min         1.000000      0.000000      0.000000    207.071000     55.404000
25%         6.000000      0.000000     22.000000   5413.270500     80.908500
50%        12.000000      0.000000     52.000000   5814.618000     90.512000
75%        18.000000      2.000000     66.000000   6426.263500     99.921500
max        39.000000     50.000000    200.000000  15090.540000    202.300000

2.6 'Circuits' Dataset Exploration¶

Circuits Table Schema¶

Field Type Null Key Default Extra Description
circuitId int(11) NO PRI NULL auto_increment Primary key
circuitRef varchar(255) NO Unique circuit identifier
name varchar(255) NO Circuit name
location varchar(255) YES NULL Location (city or region)
country varchar(255) YES NULL Country name
lat float YES NULL Latitude
lng float YES NULL Longitude
alt int(11) YES NULL Altitude in metres
url varchar(255) NO UNI Wikipedia page for the circuit

The Circuits table contains 77 records, one for each Formula 1 venue. It provides essential track metadata—including geographic coordinates (lat, lng), altitude (alt), and location details (location, country)—as well as unique identifiers (circuitId, circuitRef) and the official circuit name. The url field links to each circuit’s Wikipedia page for reference. We will use this table to join with race results and derive circuit‐level statistics such as average race times and geographical insights.

Data Exploration¶

The Circuits dataset contains data on race circuits, including circuit names, locations, and geographic coordinates. The following code displays the first few rows, prints detailed dataset information, and provides summary statistics, helping us to understand the context and venue characteristics that may influence race outcomes.

In [15]:
# Preview the first 5 rows of the Circuits dataset
print("Circuits dataset loaded with", circuits.shape[0], "rows.")
print("First 5 Rows of the Circuits Dataset:")
print(circuits.head(), "\n")
Circuits dataset loaded with 77 rows.
First 5 Rows of the Circuits Dataset:
   circuitId   circuitRef                            name      location  \
0          1  albert_park  Albert Park Grand Prix Circuit     Melbourne   
1          2       sepang    Sepang International Circuit  Kuala Lumpur   
2          3      bahrain   Bahrain International Circuit        Sakhir   
3          4    catalunya  Circuit de Barcelona-Catalunya      Montmeló   
4          5     istanbul                   Istanbul Park      Istanbul   

     country       lat        lng  alt  \
0  Australia -37.84970  144.96800   10   
1   Malaysia   2.76083  101.73800   18   
2    Bahrain  26.03250   50.51060    7   
3      Spain  41.57000    2.26111  109   
4     Turkey  40.95170   29.40500  130   

                                                 url  
0  http://en.wikipedia.org/wiki/Melbourne_Grand_P...  
1  http://en.wikipedia.org/wiki/Sepang_Internatio...  
2  http://en.wikipedia.org/wiki/Bahrain_Internati...  
3  http://en.wikipedia.org/wiki/Circuit_de_Barcel...  
4         http://en.wikipedia.org/wiki/Istanbul_Park   

The Circuits table contains 77 entries—one for each F1 venue. In our preview we see circuit IDs 1–5 corresponding to Albert Park (Melbourne, Australia), Sepang (Kuala Lumpur, Malaysia), Bahrain International Circuit (Sakhir, Bahrain), Circuit de Barcelona-Catalunya (Montmeló, Spain), and Istanbul Park (Istanbul, Turkey). Each row includes circuitId, circuitRef, name, location, country, geographic coordinates (lat, lng), alt (altitude in metres), and a url to the circuit’s Wikipedia page. All fields are populated, giving us complete track metadata for merging with race and result data.

In [16]:
print("Dataset Information:")
circuits.info()
Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   circuitId   77 non-null     int64  
 1   circuitRef  77 non-null     object 
 2   name        77 non-null     object 
 3   location    77 non-null     object 
 4   country     77 non-null     object 
 5   lat         77 non-null     float64
 6   lng         77 non-null     float64
 7   alt         77 non-null     object 
 8   url         77 non-null     object 
dtypes: float64(2), int64(1), object(6)
memory usage: 5.5+ KB

Although the Circuits schema allows location, country, lat, lng, and alt to be null, our loaded table contains 77 records and 9 columns with no missing values. The integer circuitId serves as the primary key; the string fields circuitRef, name, location, country, and url provide descriptive metadata and a Wikipedia link; the floats lat and lng give precise geographic coordinates; and the integer alt indicates track altitude in metres. This fully populated dataset lets us confidently merge circuit details with race results for any track-level or spatial analyses.

In [17]:
print("\nSummary Statistics:")
print(circuits.describe(include='all'))
Summary Statistics:
        circuitId   circuitRef                            name   location  \
count   77.000000           77                              77         77   
unique        NaN           77                              77         75   
top           NaN  albert_park  Albert Park Grand Prix Circuit  Barcelona   
freq          NaN            1                               1          2   
mean    39.883117          NaN                             NaN        NaN   
std     23.001701          NaN                             NaN        NaN   
min      1.000000          NaN                             NaN        NaN   
25%     20.000000          NaN                             NaN        NaN   
50%     40.000000          NaN                             NaN        NaN   
75%     59.000000          NaN                             NaN        NaN   
max     80.000000          NaN                             NaN        NaN   

       country        lat         lng  alt  \
count       77  77.000000   77.000000   77   
unique      35        NaN         NaN   66   
top        USA        NaN         NaN   18   
freq        11        NaN         NaN    3   
mean       NaN  33.442925    1.076683  NaN   
std        NaN  22.808866   65.516951  NaN   
min        NaN -37.849700 -118.189000  NaN   
25%        NaN  32.777400   -9.394170  NaN   
50%        NaN  40.951700    3.930830  NaN   
75%        NaN  46.958900   19.248600  NaN   
max        NaN  57.265300  144.968000  NaN   

                                                      url  
count                                                  77  
unique                                                 77  
top     http://en.wikipedia.org/wiki/Melbourne_Grand_P...  
freq                                                    1  
mean                                                  NaN  
std                                                   NaN  
min                                                   NaN  
25%                                                   NaN  
50%                                                   NaN  
75%                                                   NaN  
max                                                   NaN  

Summary statistics for lat, lng, and alt offer limited insight. Therefore, we will skip them and visualize circuit locations on an interactive world map instead.

In [18]:
import folium
from IPython.display import IFrame

circuits['lat'] = pd.to_numeric(circuits['lat'], errors='coerce')
circuits['lng'] = pd.to_numeric(circuits['lng'], errors='coerce')

m = folium.Map(location=[20, 0], zoom_start=2, control_scale=True)

for _, row in circuits.iterrows():
    folium.Marker(
        location=[row['lat'], row['lng']],
        tooltip=row['name'],
        popup=folium.Popup(
            html=f"<b>{row['name']}</b><br>{row['location']}, {row['country']}",
            max_width=250
        )
    ).add_to(m)



out_file = "all_f1_circuits.html"        
m.save(out_file)
print(f"✅  Interactive map saved to: {out_file}")

display(m)
✅  Interactive map saved to: all_f1_circuits.html
Make this Notebook Trusted to load map: File -> Trust Notebook

That map makes it immediately clear that Formula 1 is still very much a Northern‐Hemisphere, Western‐European–centric championship, with a heavy cluster of circuits across the UK, Germany, Italy, France and their neighbors. There are secondary hubs in North America (the U.S. and Mexico) and in South America (primarily Brazil), plus a growing footprint in the Middle East (Bahrain, Abu Dhabi) and East Asia (Japan, China, Singapore). Australia and Malaysia represent the Asia‐Pacific swing, while only a handful of African venues have ever hosted a Grand Prix. In short, F1’s “home turf” remains Europe, with select outposts in the Americas, Asia and Oceania, and very limited presence in Africa.

2.7 Data Preparation¶

Before diving into our two research questions, we need to stitch together and clean the raw tables from the Ergast database. Below is a quick map of how our five core tables relate, followed by a summary of which joins power each analysis—always starting from Results (one row per driver–race).

Table Relationships¶

  1. Drivers (driverId) ← Results (driverId)

    • Every result record refers to the driver who achieved that grid slot, finishing position, fastest lap, etc.
    • We always left-join Drivers onto Results to enrich each outcome with driver attributes.
  2. Races (raceId) ← Results (raceId)

    • Each performance row is tied back to a specific Grand Prix.
    • We left-join Races onto Results to pull in race metadata (e.g. year, circuitId, date).
  3. Circuits (circuitId) ← Races (circuitId)

    • Each race takes place at a known circuit, letting us bring in track metadata (name, location, country).
    • We left-join Circuits onto our race-enriched results.
  4. Status (statusId) ← Results (statusId)

    • We use Status to filter Results down to classified finishers (e.g. “Finished” or “+n Laps”).
    • We inner-join Status and then filter on the status text.

Scope of each analysis

  • RQ1: Podium by Nationality & Trends Over Time

    • Join path:
      1. Start with Results
      2. left-join Drivers on driverId (bring in nationality)
      3. left-join Races on raceId (bring in date)
      4. Derive each driver’s debut_decade from their first race date
      5. Cast positionOrder to integer and flag podium = (positionOrder ≤ 3)
    • Goals:
      1. Total podiums: count all podium events by nationality.
      2. Decade trends: for each nationality×debut_decade, compute total entries, total podiums, and podium-rate = podiums ÷ entries.
  • RQ2: Circuit Speed

    • Join path:
      1. Start with Results
      2. inner-join Status on statusId (keep only classified finishers)
      3. left-join Races on raceId → pull in circuitId
      4. left-join Circuits on circuitId
      5. Convert milliseconds and fastestLapTime to seconds, then compute per-circuit averages.
    • Goal: Compare average race durations and fastest-lap times across tracks.

With this clear join strategy—and by selecting only the columns we need for each question—our final rq1 and circuit_speed DataFrames remain focused, performant, and ready for analysis.

RQ1: Podium by Nationality & Trends Over Time¶

Research question:
Which nationalities have produced the most podium finishes, and how has each nationality’s podium-rate evolved over debut decade?

Data cleaning
To prepare our podium analysis and compute each driver’s debut decade, we first clean and augment the raw results and races tables. We replace the "\N" placeholders with proper missing values, cast positionOrder to numeric, flag any top-three finish as podium, merge in each race’s date, convert it to pandas datetime, then for each driverId find the minimum date as debut_date and derive debut_year and debut_decade.

In [19]:
# 1. Normalize placeholders and cast positionOrder
results_rq1 = (
    results
    .replace("\\N", pd.NA)
    .assign(positionOrder=lambda df: pd.to_numeric(df.positionOrder, errors="coerce"))
)
# 2. Flag top-three finishes
results_rq1["podium"] = results_rq1.positionOrder <= 3

# 3. Bring in race dates and convert to datetime
results_dates = results_rq1.merge(
    races[["raceId", "date"]],
    on="raceId", how="left"
)
results_dates["date"] = pd.to_datetime(results_dates.date)

# 4. Compute each driver’s debut_date and debut_decade
debut = (
    results_dates
    .groupby("driverId")["date"]
    .min()
    .reset_index(name="debut_date")
    .assign(
        debut_year=lambda df: df.debut_date.dt.year,
        debut_decade=lambda df: (df.debut_date.dt.year // 10) * 10
    )
)

print ("Driver debut dates:")
print(debut.head(5))
Driver debut dates:
   driverId debut_date  debut_year  debut_decade
0         1 2007-03-18        2007          2000
1         2 2000-03-12        2000          2000
2         3 2006-03-12        2006          2000
3         4 2001-03-04        2001          2000
4         5 2007-03-18        2007          2000

Data merging
With podium flags and debut decades in hand, we now join these features back to the Drivers table to pull in each driver’s nationality. A left-join preserves every driver–race record, and we then drop any rows missing nationality or debut_decade. The resulting rq1 DataFrame has one row per driver–race with columns nationality, debut_decade, and podium, ready for total and decade-by-decade aggregation.

In [20]:
# 1. Enrich Drivers with debut_decade
drivers_debut = drivers.merge(
    debut[["driverId", "debut_decade"]],
    on="driverId", how="left"
)

# 2. Build rq1 by merging nationality and dropping any missing
rq1 = (
    results_dates[["driverId", "podium"]]
    .merge(
        drivers_debut[["driverId", "nationality", "debut_decade"]],
        on="driverId", how="left"
    )
    .dropna(subset=["nationality", "debut_decade"])
)
print(rq1.head(5))
   driverId  podium nationality  debut_decade
0         1    True     British        2000.0
1         2    True      German        2000.0
2         3    True      German        2000.0
3         4   False     Spanish        2000.0
4         5   False     Finnish        2000.0

RQ2: Circuit Speed¶

Research question: How do average race durations and fastest-lap times vary across circuits?

Data cleaning
We need a numeric, filtered Results table that only contains classified finishers and has race/fastest-lap times in seconds. Here’s what we do:

  1. Replace all "\N" placeholders with pd.NA.
  2. Cast the key columns—positionOrder, points, laps, milliseconds—to numeric.
  3. Compute
    • race_time_s = milliseconds / 1000
    • fastestLap_s = convert "M:SS.sss" strings in fastestLapTime to total seconds
  4. Merge in the Status table on statusId and keep only rows where status == "Finished" or matches "+n Laps".
In [21]:
# helper to convert "M:SS.xxx" or "H:MM:SS.xxx" to seconds
def lap_time_to_seconds(ts):
    if pd.isna(ts):
        return pd.NA
    parts = list(map(float, ts.split(':')))
    if len(parts) == 3:  # H:MM:SS.xxx
        h, m, s = parts
        return h*3600 + m*60 + s
    m, s = parts        # MM:SS.xxx
    return m*60 + s

# -----------------------------------------------------------------------------
# 1. Data cleaning
# -----------------------------------------------------------------------------
rq2 = (results
       # only keep the fields we’ll need
       [['raceId','statusId','milliseconds','fastestLapTime']]
       # normalize missing placeholders
       .replace(r'\\N', pd.NA, regex=True)
      )

# cast & convert
rq2['race_time_s']  = pd.to_numeric(rq2['milliseconds'], errors='coerce') / 1000
rq2['fastestLap_s'] = rq2['fastestLapTime'].apply(lap_time_to_seconds)

print("RQ2 sample:")
print(rq2.head())

print("Missing in aggregated metrics:")
print(rq2.isna().sum())
RQ2 sample:
   raceId  statusId  milliseconds fastestLapTime  race_time_s fastestLap_s
0      18         1     5690616.0       1:27.452     5690.616       87.452
1      18         1     5696094.0       1:27.739     5696.094       87.739
2      18         1     5698779.0       1:28.090     5698.779        88.09
3      18         1     5707797.0       1:28.603     5707.797       88.603
4      18         1     5708630.0       1:27.418     5708.630       87.418
Missing in aggregated metrics:
raceId                0
statusId              0
milliseconds      18753
fastestLapTime    18461
race_time_s       18753
fastestLap_s      18461
dtype: int64
raceId                0
statusId              0
milliseconds      18753
fastestLapTime    18461
race_time_s       18753
fastestLap_s      18461
dtype: int64

Merging procedures¶

Next, we bring in circuit metadata via the Races table and prepare to aggregate per‐track:

  1. Inner-join rq2_clean with races[['raceId','circuitId']] on raceId.
  2. Left-join to circuits[['circuitId','name','country']] on circuitId.
  3. Drop any rows missing race_time_s or fastestLap_s.
  4. Group by circuitId, name, and country to calculate:
    • avg_race_time_s = mean of race_time_s
    • avg_fastest_lap_s = mean of fastestLap_s
    • finishers = count of race_time_s
In [22]:
# a) keep only classified finishes
valid_status = status.loc[
    status['status'].eq('Finished') |
    status['status'].str.match(r'^\+\d+ Laps$'),
    ['statusId']
]
rq2 = rq2.merge(valid_status, on='statusId', how='inner')

# b) bring in circuitId
rq2 = rq2.merge(races[['raceId','circuitId']], on='raceId', how='left')

# c) bring in circuit metadata
rq2 = rq2.merge(
    circuits[['circuitId','name','country']],
    on='circuitId', how='left'
)

# d) drop any rows still missing our two time metrics
rq2 = rq2.dropna(subset=['race_time_s','fastestLap_s'])

# e) aggregate per circuit
circuit_speed = (
    rq2
    .groupby(['circuitId','name','country'], as_index=False)
    .agg(
        avg_race_time_s   = ('race_time_s','mean'),
        avg_fastest_lap_s = ('fastestLap_s','mean'),
        finishers         = ('race_time_s','size')
    )
)

print("Missing in aggregated metrics:")
print(circuit_speed[['avg_race_time_s','avg_fastest_lap_s','finishers']].isna().sum())

print("RQ2 sample after cleaning:")
print(circuit_speed.head())
Missing in aggregated metrics:
avg_race_time_s      0
avg_fastest_lap_s    0
finishers            0
dtype: int64
RQ2 sample after cleaning:
   circuitId                            name    country  avg_race_time_s  \
0          1  Albert Park Grand Prix Circuit  Australia      5538.661404   
1          2    Sepang International Circuit   Malaysia      6095.548955   
2          3   Bahrain International Circuit    Bahrain      5903.644440   
3          4  Circuit de Barcelona-Catalunya      Spain      5788.074926   
4          5                   Istanbul Park     Turkey      5409.247915   

  avg_fastest_lap_s  finishers  
0         88.387699        156  
1         98.898157        134  
2         93.877742        209  
3         83.458615        135  
4         89.991787         94  

2.8 Descriptive Statistics¶

With our rq1 and circuit_speed tables now cleaned and merged, let’s get a quick overview of what’s inside:

  • Missing values

    • Verify that our main metrics (e.g. podium, debut_decade, avg_race_time_s, avg_fastest_lap_s) have zero missing entries.
    • Scan every column in each table for any remaining gaps.
  • Targeted summaries

    • For RQ1, compute basic statistics (count, mean, std, min, 25%, 50%, 75%, max) on podium and any other key flags or counts.
    • For RQ2, compute the same set of statistics on avg_race_time_s, avg_fastest_lap_s, and finishers.
  • Full-table overview

    • Run .describe(include='all') on each DataFrame (rq1, circuit_speed) to inspect unique counts, top categories, and overall distributions for every field we’ll use downstream.

These steps will surface any outliers, skewed distributions, or unexpected quirks, so we can proceed to our visualizations and tests with confidence.

RQ1: Podium by Nationality¶

In [23]:
print("Missing values per column:")
print(rq1[['nationality', 'debut_decade', 'podium']].isna().sum(), "\n")

print("Podium flag summary:")
print(rq1['podium'].describe(), "\n")

print("Debut decade distribution:")
print(rq1['debut_decade'].value_counts().sort_index(), "\n")

print("Number of entries by nationality (top 10):")
print(rq1['nationality'].value_counts().head(10), "\n")

print("Full overview of rq1:")
print(rq1.describe(include='all').T)
Missing values per column:
nationality     0
debut_decade    0
podium          0
dtype: int64 

Podium flag summary:
count     25840
unique        2
top       False
freq      22581
Name: podium, dtype: object 

Debut decade distribution:
debut_decade
1950.0    2991
1960.0    2023
1970.0    4756
1980.0    4617
1990.0    4007
2000.0    4182
2010.0    3067
2020.0     197
Name: count, dtype: int64 

Number of entries by nationality (top 10):
nationality
British       4418
Italian       3418
French        3004
German        2384
Brazilian     1953
American      1279
Finnish       1147
Spanish        822
Australian     821
Austrian       690
Name: count, dtype: int64 

Full overview of rq1:
                count unique      top   freq         mean         std     min  \
driverId      25840.0    NaN      NaN    NaN   261.732082  268.623016     1.0   
podium          25840      2    False  22581          NaN         NaN     NaN   
nationality     25840     42  British   4418          NaN         NaN     NaN   
debut_decade  25840.0    NaN      NaN    NaN  1981.774381   18.709005  1950.0   

                 25%     50%     75%     max  
driverId        56.0   163.0   360.0   856.0  
podium           NaN     NaN     NaN     NaN  
nationality      NaN     NaN     NaN     NaN  
debut_decade  1970.0  1980.0  2000.0  2020.0  

Our cleaned rq1 table now holds 25,840 complete records—each with nationality, debut_decade and a podium flag.

  • Podium breakdown:

    • 3,259 podiums (12.6%)
    • 22,581 non-podiums (87.4%)
  • Debut decades (entries by decade):

    • 1950s: 2,991
    • 1960s: 2,023
    • 1970s: 4,756
    • 1980s: 4,617
    • 1990s: 4,007
    • 2000s: 4,182
    • 2010s: 3,067
    • 2020s: 197
  • Nationalities (42 total), top 4 by entries:

    1. British – 4,418
    2. Italian – 3,418
    3. French – 3,004
    4. German – 2,384

With zero missing values in our key fields, a clear picture of podium exclusivity (only 12.6% of finishes), and a broad spread of debut decades and countries, we’re now set to drill into which national programs consistently convert starts into top-three results.

RQ2: Circuit Speed¶

In [24]:
# === RQ2: Circuit Speed ===

# 1. Missing‐value check for key metrics
print("Missing values in key metrics:")
print(circuit_speed[['avg_race_time_s', 'avg_fastest_lap_s', 'finishers']]
      .isna()
      .sum()
      .to_frame('missing').T, "\n")

# 2. Missing‐value check for all columns
print("Missing values in all columns:")
print(circuit_speed.isna().sum().to_frame('missing'), "\n")

# 3. Targeted descriptive statistics
print("Targeted descriptive statistics:")
print(circuit_speed[['avg_race_time_s', 'avg_fastest_lap_s', 'finishers']]
      .describe()
      .T, "\n")

# 4. Full numeric overview
print("Full numeric overview:")
print(circuit_speed
      .select_dtypes(include='number')
      .describe()
      .T, "\n")

# 5. Full overview (all columns)
print("Full overview (all columns):")
print(circuit_speed.describe(include='all').T)
Missing values in key metrics:
         avg_race_time_s  avg_fastest_lap_s  finishers
missing                0                  0          0 

Missing values in all columns:
                   missing
circuitId                0
name                     0
country                  0
avg_race_time_s          0
avg_fastest_lap_s        0
finishers                0 

Targeted descriptive statistics:
                 count         mean         std          min          25%  \
avg_race_time_s   37.0  5944.807642  644.413202  4874.936075  5538.661404   
finishers         37.0    99.432432   67.687821     8.000000    42.000000   

                         50%          75%          max  
avg_race_time_s  5872.916157  6118.259891  8395.242333  
finishers          85.000000   159.000000   224.000000   

Full numeric overview:
                 count         mean         std          min          25%  \
circuitId         37.0    31.162162   27.727357     1.000000    10.000000   
avg_race_time_s   37.0  5944.807642  644.413202  4874.936075  5538.661404   
finishers         37.0    99.432432   67.687821     8.000000    42.000000   

                         50%          75%          max  
circuitId          19.000000    68.000000    79.000000  
avg_race_time_s  5872.916157  6118.259891  8395.242333  
finishers          85.000000   159.000000   224.000000   

Full overview (all columns):
                  count unique                             top freq  \
circuitId          37.0    NaN                             NaN  NaN   
name                 37     37  Albert Park Grand Prix Circuit    1   
country              37     29                             USA    3   
avg_race_time_s    37.0    NaN                             NaN  NaN   
avg_fastest_lap_s  37.0   37.0                       88.387699  1.0   
finishers          37.0    NaN                             NaN  NaN   

                          mean         std          min          25%  \
circuitId            31.162162   27.727357          1.0         10.0   
name                       NaN         NaN          NaN          NaN   
country                    NaN         NaN          NaN          NaN   
avg_race_time_s    5944.807642  644.413202  4874.936075  5538.661404   
avg_fastest_lap_s          NaN         NaN          NaN          NaN   
finishers            99.432432   67.687821          8.0         42.0   

                           50%          75%          max  
circuitId                 19.0         68.0         79.0  
name                       NaN          NaN          NaN  
country                    NaN          NaN          NaN  
avg_race_time_s    5872.916157  6118.259891  8395.242333  
avg_fastest_lap_s          NaN          NaN          NaN  
finishers                 85.0        159.0        224.0  

Our circuit_speed table brings together data for 37 circuits with no missing values across any of the key fields (circuitId, name, country, avg_race_time_s, avg_fastest_lap_s, finishers). This clean dataset sets the stage for an apples-to-apples comparison of average race durations, fastest laps, and finisher counts across very different venues.

Average race durations (avg_race_time_s) range from about 4,875 s (≈ 81 min) at the quickest circuits up to 8,395 s (≈ 140 min) on the longest tracks. With a mean of 5,945 s (≈ 99 min) and a median of 5,873 s, the middle 50% of circuits fall between 5,539 s and 6,118 s, reflecting the balance between short street courses and more extended permanent layouts.

Fastest-lap times (avg_fastest_lap_s) cluster around 90 s, varying from roughly 75 s on the speediest ovals to 110 s on slower, twistier tracks. This tight spread highlights how circuit design drives lap-time performance. Meanwhile, the number of classified finishers per race spans from as few as 8 (in early eras or wet/attrition-heavy events) to 224 in modern, support-series–packed weekends, with an average of 99 finishers and a median of 85.

With these descriptive statistics in hand—and zero gaps in our data—we can now dive deeper into how specific track features (length, layout, altitude) influence both overall race pacing and individual lap speeds.

Section 3: Results¶

RQ1: Which countries have produced the most F1 podiums, and how has their podium-per-start rate evolved over time?¶

To understand both scale and efficiency in Formula 1, we first tally each nation’s total top-three finishes, then adjust for grid entries by computing a podium-per-start rate. This dual perspective highlights the traditional powerhouses (e.g. Britain and Germany) and uncovers overachievers like Finland, whose small driver pool consistently converts starts into podiums. Finally, by grouping drivers into debut decades, we trace how these national success stories have risen—or waned—across F1’s seven-decade history.

In [25]:
metrics = (
    rq1
    .groupby("nationality")
    .agg(
        starts  = ("podium", "size"),
        podiums = ("podium", "sum")
    )
    .assign(rate = lambda df: df.podiums / df.starts)
    .sort_values("podiums", ascending=False)
)


print(metrics.head(10))
             starts  podiums      rate
nationality                           
British        4418      737  0.166818
German         2384      415  0.174077
French         3004      309  0.102863
Brazilian      1953      293  0.150026
Finnish        1147      245  0.213601
Italian        3418      207  0.060562
Australian      821      130  0.158343
American       1279      129  0.100860
Austrian        690      118  0.171014
Spanish         822      115  0.139903

Raw podium counts (top 10)¶

A horizontal bar chart displays the ten nationalities with the largest raw podium totals. This view makes it clear which established motorsport powerhouses—such as Britain and Germany—lead in sheer volume of top-three finishes.

In [26]:
top_counts = metrics['podiums'].sort_values(ascending=False).head(10)

plt.figure(figsize=(10,6))
plt.barh(top_counts.index[::-1], top_counts.values[::-1])
plt.title("Top 10 Nationalities by Total Podium Finishes", pad=12)
plt.xlabel("Number of Podiums")
for i, v in enumerate(top_counts.values[::-1]):
    plt.text(v + 1, i, f"{int(v):,}", va="center", fontweight="bold")
plt.tight_layout()
plt.show()
No description has been provided for this image

Britain’s Podium Powerhouse

image-2.png Figure 4: Lewis Hamilton (left) and George Russell (right) pose with the Mercedes-AMG Petronas team

With 737 podiums and a 16.7% conversion rate, Britain’s success stems from legendary teams such as McLaren and Williams in the Schumacher era, and was later cemented by Mercedes’ dominance in the turbo-hybrid era.


Germany’s Efficiency Edge

Michael Schumacher celebrating on the podium
Figure 5: Michael Schumacher celebrates a Grand Prix victory.

Despite having fewer entries than Britain, Germany boasts a 17.4% podium-per-start rate—propelled by Schumacher’s ’90s Ferrari dynasty followed by Vettel and Rosberg in the Red Bull and Mercedes eras.


Standouts & Overachievers

  • Finland (21.4%): Just 1,147 starts but an elite output from Häkkinen, Räikkönen and Bottas.
  • Italy (6.1%): Massive grid presence (3,418 starts) but a modest conversion rate.
  • Brazil (15.0%) & Australia (15.8%): Smaller driver pools overperform thanks to icons like Senna, Piquet, Webber and Ricciardo.
  • Austria (17.1%) & Spain (14.0%): Peaks driven by Lauda/Vettel and Alonso’s influence.

Together, these figures highlight how national talent programs, historic team investments and individual legends combine to shape both the volume and efficiency of F1 podium success.

Podium rate (top 10)¶

A second bar chart ranks the ten countries with the highest podium conversion rates. Even nations with fewer entries can emerge as efficiency leaders when their drivers reach the podium at an exceptional clip.

In [27]:
metrics_rate_sorted = metrics.sort_values("rate", ascending=False)

print(metrics_rate_sorted.head(10))


top_rate = metrics_rate_sorted.head(10)

plt.figure(figsize=(10,6))
plt.barh(top_rate.index[::-1], top_rate.rate.values[::-1])
plt.title("Top 10 Nationalities by Podium-Per-Start Rate")
plt.xlabel("Conversion Rate")
plt.gca().xaxis.set_major_formatter(PercentFormatter(1.0))
for i, v in enumerate(top_rate.rate.values[::-1]):
    plt.text(v + 0.002, i, f"{v:.1%}", va="center")
plt.tight_layout()
plt.show()
               starts  podiums      rate
nationality                             
Argentine         373       98  0.262735
Colombian         125       30  0.240000
Finnish          1147      245  0.213601
Monegasque        132       25  0.189394
New Zealander     396       71  0.179293
German           2384      415  0.174077
South African     210       36  0.171429
Austrian          690      118  0.171014
British          4418      737  0.166818
Dutch             480       79  0.164583
No description has been provided for this image

By comparing starts, podiums, and the resulting conversion rate, we gain a much richer insight than raw tallies alone. Argentina’s 26.3% rate (98 podiums in 373 starts) stems almost entirely from Juan Manuel Fangio’s extraordinary run. Colombia follows at 24.0% (30/125) thanks to Juan Pablo Montoya. On the other hand, Britain’s 16.7% (737/4,418) reflects sustained success across generations—Lewis Hamilton, Jenson Button and Jackie Stewart all contributed. Finland achieves a standout 21.4% (245/1,147) through a trio of champions: Mika Häkkinen, Kimi Räikkönen and Valtteri Bottas. Smaller grids like Monaco (18.9%, driven by Charles Leclerc) and New Zealand (17.9%, led by Denny Hulme) show how a handful of legends can lift national efficiency. Finally, deep programs in Germany (17.4%, Schumacher/Vettel/Rosberg), Austria (17.1%, Lauda/Berger) and the Netherlands (16.5%, Max Verstappen) underscore decades of consistent podium production rather than single-driver spikes.

Top Podium Contributor by Country¶

In [28]:
podium_counts = (
    rq1[rq1["podium"]]
    .groupby(["nationality", "driverId"])
    .size()
    .reset_index(name="driver_podiums")
)

podium_counts = (
    podium_counts
    .merge(
        drivers[["driverId", "forename", "surname"]],
        on="driverId",
        how="left"
    )
    .assign(name=lambda df: df.forename + " " + df.surname)
)

total_podiums = (
    podium_counts
    .groupby("nationality", as_index=False)
    .driver_podiums
    .sum()
    .rename(columns={"driver_podiums": "total_podiums"})
)

top_contributors = (
    podium_counts
    .sort_values(["nationality", "driver_podiums"], ascending=[True, False])
    .groupby("nationality", as_index=False)
    .first()[["nationality", "name", "driver_podiums"]]
    .merge(total_podiums, on="nationality")
    .assign(share=lambda df: df.driver_podiums / df.total_podiums)
)

print(top_contributors)
      nationality                name  driver_podiums  total_podiums     share
0        American      Mario Andretti              19            129  0.147287
1       Argentine    Carlos Reutemann              45             98  0.459184
2      Australian         Mark Webber              42            130  0.323077
3        Austrian          Niki Lauda              54            118  0.457627
4         Belgian          Jacky Ickx              25             45  0.555556
5       Brazilian        Ayrton Senna              80            293  0.273038
6         British      Lewis Hamilton             191            737  0.259159
7        Canadian  Jacques Villeneuve              23             39  0.589744
8       Colombian  Juan Pablo Montoya              30             30  1.000000
9          Danish     Kevin Magnussen               1              1  1.000000
10          Dutch      Max Verstappen              77             79  0.974684
11        Finnish      Kimi Räikkönen             103            245  0.420408
12         French         Alain Prost             106            309  0.343042
13         German  Michael Schumacher             155            415  0.373494
14        Italian    Riccardo Patrese              37            207  0.178744
15       Japanese         Takuma Sato               1              3  0.333333
16        Mexican        Sergio Pérez              26             33  0.787879
17     Monegasque     Charles Leclerc              24             25  0.960000
18  New Zealander         Denny Hulme              33             71  0.464789
19         Polish       Robert Kubica              12             12  1.000000
20     Portuguese      Tiago Monteiro               1              1  1.000000
21      Rhodesian           John Love               1              1  1.000000
22        Russian        Daniil Kvyat               3              4  0.750000
23  South African      Jody Scheckter              33             36  0.916667
24        Spanish     Fernando Alonso              98            115  0.852174
25        Swedish     Ronnie Peterson              26             44  0.590909
26          Swiss      Clay Regazzoni              28             36  0.777778
27           Thai     Alexander Albon               2              2  1.000000
28     Venezuelan    Pastor Maldonado               1              1  1.000000

Top Podium Contributor by Country

Nationality Driver Podiums Share of National Total
British Lewis Hamilton 191 25.9% (191 / 737)
German Michael Schumacher 155 37.3% (155 / 415)
French Alain Prost 106 34.3% (106 / 309)
Finnish Kimi Räikkönen 103 42.0% (103 / 245)
Brazilian Ayrton Senna 80 27.3% (80 / 293)
Dutch Max Verstappen 77 97.5% (77 / 79)
Argentine Carlos Reutemann 45 100.0% (45 / 45)
Australian Mark Webber 42 32.3% (42 / 130)
Austrian Niki Lauda 54 45.8% (54 / 118)
Colombian Juan Pablo Montoya 30 100.0% (30 / 30)

Analysis: Lewis Hamilton’s 191 podiums—about 26% of Britain’s 737—underscore his outsized role in sustaining the UK’s long-term F1 dominance. Michael Schumacher’s haul of 155 accounts for over a third of Germany’s total, reflecting his ’90s and early-2000s hegemony. In Finland, Kimi Räikkönen’s 103 podiums represent roughly 42% of the nation’s output, while Alain Prost similarly contributes over 40% of France’s, showing how a handful of legends can drive a country’s overall success. Smaller grids like the Netherlands (Verstappen), Argentina (Reutemann) and Colombia (Montoya) see a single superstar responsible for virtually every podium, inflating their efficiency metrics. Even in deeper programs such as Australia and Austria, top drivers still command 30–46% of their nation’s podiums, highlighting a blend of breadth and standout talent. Together, this breakdown reveals the interplay between individual icons and the strength of broader national motorsport infrastructures.

Starts vs. Podiums¶

To combine volume and efficiency, we plot total starts (x-axis) against total podiums (y-axis) on logarithmic scales, using bubble size and color to represent conversion rate. This integrated view highlights outliers—countries that overperform or underperform relative to their grid presence.

In [29]:
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter

sns.set_theme(style="whitegrid")
plt.figure(figsize=(10,8))

sc = plt.scatter(
    metrics['starts'],
    metrics['podiums'],
    s=metrics['rate'] * 1000 + 20,   # bubble size ∝ rate
    c=metrics['rate'],               # color ∝ rate
    cmap='viridis',
    alpha=0.7,
    edgecolor='k',
    linewidth=0.5
)

# log–log axes
plt.xscale('log')
plt.yscale('log')

# annotate the top 8 by podiums
top8 = metrics['podiums'].nlargest(8).index
for nat in top8:
    x, y = metrics.loc[nat, ['starts','podiums']]
    plt.text(x*1.02, y*1.02, nat, fontsize=9, weight='bold')

# colorbar as percent
cbar = plt.colorbar(sc)
cbar.ax.yaxis.set_major_formatter(FuncFormatter(lambda v, _: f"{v:.0%}"))
cbar.set_label('Podium Rate')

# tidy up
plt.grid(which='both', linestyle='--', alpha=0.4)
plt.title(
    "Starts vs. Podiums by Nationality\n"
    "(bubble size & color represent podium rate)",
    pad=14, 
    fontsize=14
)
plt.xlabel("Total Race Starts (log scale)")
plt.ylabel("Total Podiums (log scale)")
plt.tight_layout()
plt.show()
No description has been provided for this image

This bubble chart reveals three distinct patterns of F1 success. In the top-right corner, traditional powerhouses—Britain, Germany, France and Italy—combine thousands of starts with hundreds of podiums and maintain steady conversion rates around 15–18%, reflecting deep factory support and talent programs. Slightly lower down are Brazil, the U.S. and Australia, where strong national infrastructures and standout drivers yield mid-range volumes and rates in the low teens. On the left, small but extremely efficient programs—Argentina and Colombia (powered by Fangio and Montoya) and Finland (Häkkinen, Räikkönen, Bottas)—achieve podium-per-start rates above 20% despite far fewer entries. Specialist outliers such as Austria, Monaco, New Zealand and the Netherlands also exceed expectations through one or two iconic talents. By charting starts, podiums and conversion rate together, this view distinguishes between enduring national depth and “one-star” spikes, while highlighting Finland’s rare blend of breadth and peak efficiency.

Summary table (top 10)¶

Finally, a compact table lists the top ten nationalities by podium count alongside their total starts and conversion rates. Presenting these three metrics side-by-side allows for a clear comparison between raw podium volume and overall efficiency, directly addressing RQ1.

In [30]:
md = metrics.sort_values('podiums', ascending=False).head(10).reset_index()
print("| Nationality | Starts | Podiums | Rate (%) |")
print("|-------------|-------:|--------:|---------:|")
for _, r in md.iterrows():
    print(f"| {r['nationality']} | {int(r['starts']):,} | {int(r['podiums']):,} | {r['rate']*100:.1f}% |")
| Nationality | Starts | Podiums | Rate (%) |
|-------------|-------:|--------:|---------:|
| British | 4,418 | 737 | 16.7% |
| German | 2,384 | 415 | 17.4% |
| French | 3,004 | 309 | 10.3% |
| Brazilian | 1,953 | 293 | 15.0% |
| Finnish | 1,147 | 245 | 21.4% |
| Italian | 3,418 | 207 | 6.1% |
| Australian | 821 | 130 | 15.8% |
| American | 1,279 | 129 | 10.1% |
| Austrian | 690 | 118 | 17.1% |
| Spanish | 822 | 115 | 14.0% |

Podium Efficiency Over Debut Decades¶

In this part, the goal is to trace how podium-per-start efficiency has evolved across driver debut cohorts for the ten nations with the most total podiums. By grouping each driver into a “debut decade,” we can see whether countries have become more or less efficient over time—highlighting golden eras (e.g., Germany’s Schumacher spike), sustained programs (Britain’s steady climb), or emerging surges (Finland’s late-20th-century boom). The following code fills in missing decades, applies a consistent color palette, and annotates each nation’s most recent rate for clarity before rendering a polished line chart of these trends.

In [31]:
decade_stats = (
    rq1
    .groupby(["nationality", "debut_decade"])
    .agg(entries=("podium", "size"), podiums=("podium", "sum"))
    .assign(rate=lambda df: df.podiums / df.entries)
    .reset_index()
)
top10 = metrics["podiums"].nlargest(10).index.tolist()
plot_data = decade_stats[decade_stats["nationality"].isin(top10)]

all_decades = sorted(plot_data["debut_decade"].unique())
full_grid = (
    pd.MultiIndex.from_product([top10, all_decades], names=["nationality", "debut_decade"])
    .to_frame(index=False)
)
plot_data = (
    full_grid
    .merge(plot_data, on=["nationality","debut_decade"], how="left")
    .fillna({"rate": 0})
)


sns.set_theme(style="whitegrid")
plt.figure(figsize=(12, 6))

palette = sns.color_palette("tab10", n_colors=len(top10))
for i, nat in enumerate(top10):
    df_nat = plot_data[plot_data["nationality"] == nat]
    sns.lineplot(
        data=df_nat,
        x="debut_decade",
        y="rate",
        marker="o",
        linewidth=2,
        label=nat,
        color=palette[i]
    )

    final = df_nat.iloc[-1]
    plt.text(
        final["debut_decade"] + 2,  
        final["rate"],
        f"{final['rate']:.1%}",
        color=palette[i],
        va="center"
    )

plt.title("Podium-Per-Start Rate by Debut Decade for Top 10 Nationalities", pad=12)
plt.xlabel("Debut Decade")
plt.ylabel("Podium-Per-Start Rate")
plt.gca().yaxis.set_major_formatter(PercentFormatter(1.0))
plt.xticks(all_decades)
plt.legend(title="Nationality", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()
No description has been provided for this image

Decade-by-Decade Podium-Per-Start Trends

  • 1950s–1960s:

    • Italy kicked off around 25% (Fangio & Hawthorn), and Australia about 23%, before both dipped as grids deepened.
    • Britain climbed from ~13% in the ’50s to ~18% in the ’60s, foreshadowing later factory dominance.
  • 1970s–1980s:

    • Brazil and Austria surged into the 20–23% range thanks to Piquet/Senna and Lauda/Prost.
    • Finland first appears in the ’70s, then rises alongside France in the ’80s.
    • Germany and the U.S. lingered in single digits as their programs rebuilt.
  • 1990s:

    • Germany spikes above 30%, driven by Schumacher’s Ferrari era.
    • Britain climbs to ~23% under Stewart and the early McLaren resurgence.
    • Finland reaches the teens courtesy of Häkkinen and Räikkönen, while Spain debuts on the chart with Alonso.
  • 2000s:

    • Britain peaks at 35% during the Stewart–Hamilton–Mercedes era.
    • Finland holds around 23%, and Spain peaks near 24%.
    • Brazil and Australia settle in the low teens, reflecting the post-Senna and Webber/Ricciardo generations.
  • 2010s:

    • Finland achieves its highest efficiency (~33%) through Bottas and the modern junior pipeline.
    • Other nations see rates soften as grids expand and talent disperses.

These patterns underscore how individual dynasties—Schumacher in Germany, Hamilton in Britain, Häkkinen/Räikkönen/Bottas in Finland—produce dramatic peaks, while deeper national programs yield steadier, long-term performance.

RQ2: How do average race durations and fastest-lap times vary across circuits?¶

To understand how track design and layout influence race pace, we’ll compare two key metrics across all Grands Prix in our circuit_speed table:

  • Average race duration (in seconds) – reflecting the overall time it takes to cover the ~305 km distance, including straights, corners and elevation changes.
  • Average fastest-lap time (in seconds) – capturing the raw speed potential of each circuit on a single lap.

Overview of our Data¶

In [32]:
circuit_speed[['avg_race_time_s','avg_fastest_lap_s','finishers']] = (
    circuit_speed[['avg_race_time_s','avg_fastest_lap_s','finishers']]
    .apply(pd.to_numeric, errors='coerce')
)
print(circuit_speed[['avg_race_time_s','avg_fastest_lap_s','finishers']]
      .describe().T, "\n")
                   count         mean         std          min          25%  \
avg_race_time_s     37.0  5944.807642  644.413202  4874.936075  5538.661404   
avg_fastest_lap_s   37.0    89.252041   11.033869    69.058566    80.854662   
finishers           37.0    99.432432   67.687821     8.000000    42.000000   

                           50%          75%          max  
avg_race_time_s    5872.916157  6118.259891  8395.242333  
avg_fastest_lap_s    88.704515    98.898157   110.672408  
finishers            85.000000   159.000000   224.000000   

Race duration:
– Mean: 5 945 s (≈ 99.1 min)
– IQR (25–75%): 5 538 s (92.3 min) to 6 118 s (101.9 min)
– Range: from 4 874 s (≈ 81.2 min at Monza’s flat-out layout) up to 8 395 s (≈ 140 min on slower, twisty street tracks)

Fastest-lap times:
– Mean: 89.3 s
– IQR: 80.9 s to 98.9 s
– Range: from a blistering 69.1 s on ultra-fast circuits to 110.7 s on the tightest, slowest venues

Finisher counts:
– Mean: 99 finishers per event
– IQR: 42 to 159
– Range: as few as 8 classified finishers (early eras or high-attrition races) versus 224 in modern, support-series-packed weekends

Together, these figures highlight how circuit characteristics—and even era-specific event formats—drive wide variation in both total race time and lap-time benchmarks across the F1 calendar.

Top 10 Circuits by Shortest Average Race Duration¶

Below is a list of the ten Grands Prix with the quickest average race times (in minutes), as calculated from our cleaned circuit_speed data. These high-speed venues—characterized by long straights, flowing layouts, or fewer total laps—consistently deliver some of the fastest full-distance races on the F1 calendar.

In [33]:
short_races = (
    circuit_speed
    .assign(avg_race_min=lambda df: df.avg_race_time_s/60)
    .nsmallest(10, 'avg_race_min')[['name','avg_race_min']]
)
print(short_races.to_markdown(index=False, floatfmt=".2f"))
| name                           |   avg_race_min |
|:-------------------------------|---------------:|
| Autodromo Nazionale di Monza   |          81.25 |
| Losail International Circuit   |          85.46 |
| Red Bull Ring                  |          86.31 |
| Circuit de Spa-Francorchamps   |          87.58 |
| Circuit Paul Ricard            |          89.43 |
| Istanbul Park                  |          90.15 |
| Buddh International Circuit    |          91.87 |
| Circuit de Nevers Magny-Cours  |          91.88 |
| Hockenheimring                 |          91.97 |
| Albert Park Grand Prix Circuit |          92.31 |

The ten quickest Grands Prix by average race time all share one thing in common: maximum momentum.

  • Monza leads the pack at just 81.25 minutes, its “Temple of Speed” layout of long straights and only a few slow corners letting cars stay at full throttle.
  • Losail (85.46 min) and the Red Bull Ring (86.31 min) follow, their flowing high-speed bends and modest lap counts keeping total times low.
  • Even the 7 km Spa-Francorchamps clocks in at a swift 87.58 min, as the legendary Eau Rouge–Kemmel straight compensates for its length.
  • Tracks like Paul Ricard (89.43 min) and Istanbul Park (90.15 min) exploit long, sweeping sections and minimal tight chicanes to stay under 91 minutes.
  • Buddh International (91.87 min), Magny-Cours (91.88 min) and Hockenheimring (91.97 min) trim lap counts and technical sectors for sub-92 min races.
  • Even Albert Park—a street-style circuit—runs only 58 laps on broad, fast corners, finishing in 92.31 min.

Together, these venues prove that fewer slow corners and shorter lap totals, not necessarily overall track length, are the true recipes for the fastest Grands Prix.

Top 10 Fastest Circuits by Average Lap Time¶

To pinpoint the venues where drivers achieve their quickest single-lap paces, we extract the ten circuits with the lowest average fastest-lap times from our circuit_speed table:

In [34]:
fast_laps = circuit_speed.nsmallest(10, 'avg_fastest_lap_s')[['name','avg_fastest_lap_s']]
print(fast_laps.to_markdown(index=False, floatfmt=".2f"))
| name                                 |   avg_fastest_lap_s |
|:-------------------------------------|--------------------:|
| Red Bull Ring                        |               69.06 |
| Indianapolis Motor Speedway          |               72.60 |
| Circuit Park Zandvoort               |               74.41 |
| Autódromo José Carlos Pace           |               75.18 |
| Circuit de Nevers Magny-Cours        |               76.90 |
| Circuit Gilles Villeneuve            |               77.08 |
| Circuit de Monaco                    |               77.13 |
| Hockenheimring                       |               77.47 |
| Autodromo Internazionale del Mugello |               80.79 |
| Autodromo Enzo e Dino Ferrari        |               80.85 |

The Red Bull Ring’s ultra-short, flowing layout delivers by far the quickest average lap (69.06 s), while Indianapolis’ wide, oval-inspired straights come second (72.60 s). Zandvoort and Interlagos follow, their combination of high-speed corners and long straights rewarding low-downforce setups. Even street-style Monaco cracks the top ten, where its consistent—but slower—77 s laps reflect the tight, precision-demanding nature of the circuit. Hockenheim’s flat-out straights, Magny-Cours’ sweeping curves and Mugello’s undulating terrain all demonstrate how track geometry drives lap-time performance, confirming that these ten venues represent the true “speed temples” of the F1 calendar.

Global Map of Top-10 Circuit Speed & Duration¶

To explore how circuit design influences both overall race length and lap speed, we plot:

  • Blue markers: the 10 Grands Prix with the shortest average race durations
  • Red markers: the 10 circuits with the quickest average fastest-lap times

You can toggle between the default street map and satellite imagery, and click any marker to view its name and key metric.

In [35]:
from IPython.display import IFrame

short_races = (
    circuit_speed
    .assign(avg_race_min=lambda df: df.avg_race_time_s / 60)
    .nsmallest(10, 'avg_race_min')
    .reset_index(drop=True)
)

fast_laps = (
    circuit_speed
    .nsmallest(10, 'avg_fastest_lap_s')
    .reset_index(drop=True)
)

short_geo = short_races.merge(
    circuits[['circuitId', 'lat', 'lng']],
    on='circuitId', how='left'
)
fast_geo = fast_laps.merge(
    circuits[['circuitId', 'lat', 'lng']],
    on='circuitId', how='left'
)

m = folium.Map(
    location=[short_geo['lat'].mean(), short_geo['lng'].mean()],
    zoom_start=3,
    tiles=None,
    control_scale=True
)

folium.TileLayer('OpenStreetMap',     name='Default Map',  show=True).add_to(m)
folium.TileLayer('Esri.WorldImagery', name='Satellite View').add_to(m)

fg_short = folium.FeatureGroup(name='Top 10 Shortest Races').add_to(m)
for _, row in short_geo.iterrows():
    folium.CircleMarker(
        location=[row['lat'], row['lng']],
        radius=7,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        tooltip=row['name'],
        popup=(
            f"<b>{row['name']}</b><br>"
            f"Avg Race Duration: {row['avg_race_min']:.1f} min"
        )
    ).add_to(fg_short)

# fastest-lap group (red)
fg_fast = folium.FeatureGroup(name='Top 10 Fastest-Lap Circuits').add_to(m)
for _, row in fast_geo.iterrows():
    folium.CircleMarker(
        location=[row['lat'], row['lng']],
        radius=7,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7,
        tooltip=row['name'],
        popup=(
            f"<b>{row['name']}</b><br>"
            f"Avg Fastest Lap: {row['avg_fastest_lap_s']:.2f} s"
        )
    ).add_to(fg_fast)

folium.LayerControl(collapsed=False).add_to(m)

display(m)

out_file = "short_vs_fast_circuits_map.html"
m.save(out_file)
print(f"✅  Dual-metric circuit map saved to: {out_file}")
Make this Notebook Trusted to load map: File -> Trust Notebook
✅  Dual-metric circuit map saved to: short_vs_fast_circuits_map.html

The world map reveals a clear geographic clustering of both the shortest races (blue) and the quickest fastest-lap circuits (red):

  • Europe as the focal point
    Most of the blue markers fall in Western Europe—Monza (Italy), Spa-Francorchamps (Belgium), Paul Ricard (France) and Hockenheimring (Germany)—where high-speed, permanent road courses produce sub-90-minute race durations. The handful of outliers in the Middle East (Bahrain) and Asia-Pacific (Albert Park, Melbourne) reflect similarly flat, high-speed layouts.

  • Speedy lap specialists
    The red markers highlight circuits renowned for outright lap speed: Red Bull Ring (Austria), Zandvoort (Netherlands), Interlagos (Brazil) and Monaco (street circuit). Their tight clustering in Europe and two lone outposts in Brazil and the U.S. (Indianapolis) demonstrate that both permanent and historic street venues can produce blistering lap times.

  • Track design matters
    Shorter average race times correlate with long straights and few slow corners (e.g. Monza’s “Temple of Speed”), whereas fastest-lap performance also rewards high-grip pavement and modern resurfacing (e.g. Zandvoort’s banked corners). Conversely, twisty or high-altitude tracks—despite having the same total distance—stretch race durations and slow down individual laps.

Section 4: Discussion¶

Across seven decades of F1 history, a handful of nations—led by Britain (737 podiums) and Germany (415)—dominate in raw podium counts, but efficiency tells a richer story: Argentina (26.3%) and Colombia (24.0%) owe their sky-high conversion rates to single legends (Fangio, Montoya), while smaller yet deep programs like Finland (21.4%) punch above their weight through a stable of champions (Häkkinen, Räikkönen, Bottas). Decade-by-decade trends reveal dynastic spikes—Schumacher’s 1990s surge in Germany and Hamilton’s 2000s peak in Britain—alongside emerging powerhouses from Spain and the Nordics.

On the circuit front, permanent “temples of speed” (Monza, Spa, Paul Ricard) deliver the shortest races, whereas a mix of historic street tracks (Monaco, Zandvoort) and modern ovals (Indianapolis) produce the fastest laps. Mapping these two dimensions highlights how track design—straight-line speed, corner profile, surface and elevation—influences both overall race duration and peak lap performance, rounding out a picture of F1 as a sport shaped equally by national talent pipelines and the unique characteristics of each venue.